Information sharing in high-dimensional gene expression data for improved parameter estimation in concentration-response modelling

In toxicological concentration-response studies, a frequent goal is the determination of an ‘alert concentration’, i.e. the lowest concentration where a notable change in the response in comparison to the control is observed. In high-throughput gene expression experiments, e.g. based on microarray or RNA-seq technology, concentration-response profiles can be measured for thousands of genes simultaneously. One approach for determining the alert concentration is given by fitting a parametric model to the data which allows interpolation between the tested concentrations. It is well known that the quality of a model fit improves with the number of measured data points. However, adding new replicates for existing concentrations or even several replicates for new concentrations is time-consuming and expensive. Here, we propose an empirical Bayes approach to information sharing across genes, where in essence a weighted mean of the individual estimate for one specific parameter of a fitted model and the mean of all estimates of the entire set of genes is calculated as a result. Results of a controlled plasmode simulation study show that for many genes a notable improvement in terms of the mean squared error (MSE) between estimate and true underlying value of the parameter can be observed. However, for some genes, the MSE increases, and this cannot be prevented by using a more sophisticated prior distribution in the Bayesian approach.

In the point-by-point letter the comments of the reviewers are printed in bold, followed by our answers.The individual points raised in the separate file by Reviewer #3 are pasted into this letter separately, followed by our answers and references to changes in the manuscript.
We are looking forward to your response.

Point-by point letter:
Reviewer #1: In the current era, simulation studies are growing in the field of biological studies.Yet, it's still a growing field.In the same sense, the presented paper proposes a new method to improve the estimation of minimal toxicity.The authors used an approach where information is shared across genes which allows the relaxation of common parameters and hence allows some improvements in the statistical analyses.Empirical Bayes approaches were used to carry out the study.Results were clearly detailed and explained.The overall notion shows significant merit in the application of robust Bayes and mixing distribution Bayes approaches for acquiring more information from certain data.Indeed, this could be useful for several biological models.
Thank you for this positive assessment of the work.

Reviewer #2:
Your approach to information sharing in high-dimensional gene expression data is both innovative and insightful.The utilization of concentration-response modeling, coupled with the incorporation of information sharing techniques, showcases your expertise in handling complex biological datasets.The results you have obtained demonstrate the potential for improved parameter estimation, which holds promising implications for advancing our understanding of gene expression patterns and their regulatory mechanisms.I commend your meticulous data analysis and the clarity with which you have presented your findings.Your attention to detail and rigorous methodology inspire confidence in the validity and reliability of your research outcomes.Furthermore, your ability to effectively communicate complex concepts ensures that your work can be understood and appreciated by both fellow researchers and non-experts alike.
In addition to the scientific rigor, your work also exemplifies your passion for the subject matter.It is evident that you possess a genuine curiosity for unraveling the intricacies of gene expression, and this enthusiasm shines through in every aspect of your research.

Reviewer #3:
The paper titled "Information sharing in high-dimensional gene expression data for improved parameter estimation in concentration-response modelling" utilizes a four-parameter logistic function to model toxicological concentration-response, a model that was previously proposed.The novelty of this work lies in the application of a Bayesian approach to obtain probabilities (posterior distributions fed by previous information).This approach considers the alert concentration EC50 as a parameter.Ultimately, the results are compared with those obtained using other three different approaches.
Considering that PLOS ONE objectively focuses on the technical aspects of a study rather than subjective evaluations, and after having read the paper I recommend rejecting the current manuscript.My reasons for this recommendation are attached in a file.
We included the remarks from the file in this letter and answer the points individually.
The context of the problem and the proposed method is specifically identified in the abstract and the Section Introduction.In the corresponding first sentences, toxicological research in the context of gene expression analysis is identified as application for the proposed method.
Regarding the understanding and interpretation of EC values, we added additional explanatory sentences and paragraphs in three positions of the Section Introduction.
First, a general definition of EC values is given with the following statement: "For cell viability experiments, where the measured response is given by some percentage, EC values are typically calculated in an absolute way, i.e. as the concentration, where the fitted curve attains the specific pre-defined percentage.In applications such as gene expression data, where the responses themselves do not correspond to percentages, EC values are calculated in a relative way.In these cases, a certain percentage of the overall effect, i.e. the difference in the response values between the highest and lowest measured concentration (or between the two asymptotes of a fitted model), needs to be attained by the fitted curve."Second, we added the following explanation to the formal explanation of the EC50 in our context in the Introduction: "In the situation of gene expression as response data, as considered in this work, the EC50 is to be understood in a relative way." Third, we provided context on the cited reference where a similar method has been used for ED90 values instead of EC50 values: "Such an approach has been used before in [8], in a simpler way, where a mixture of two normal distributions was fitted for ED90 (here referring to doses instead of concentrations) values on log-scale, where the response value was given by the biomass of plants after treatment with different doses of a herbicide." We also added the definition of the MSE as mean squared error to the abstract.
There is a small typo in your formula, the correct formula is: Please note that, in our manuscript, instead of y, the letter x is used.
The prior refers to the distribution of the parameter ẽ only, since this is the only parameter to which the Bayesian information sharing is applied.We added the sentence "Only this parameter from the 4pLL model is thus considered for the Bayesian information sharing" to the respective part of the Statistical Methods Section to underline this.
The likelihood is not (|) (which corresponds to the posterior), but (|) (see general statement about the Bayes formula above).As stated in the previous paragraph, the Bayesian approach is not applied to the entire model (, ), but to the parameter ẽ only.So the likelihood is given by the formula | ~ (,  2 ), as explicitly stated in the manuscript in the Section Statistical Methods.This is correct, however it is not explicitly stated in the manuscript, since this is only a normalizing constant and does not need to be calculated for the Bayesian formula (which is typically formulated as a proportional relationship between posterior distribution on the one side and likelihood and prior on the other side).
The posterior distribution is explicitly stated in Equation ( 2) of the manuscript.This Equation explicitly refers to the different parameters.All information about the formulas used for the simulation are thus explicitly given, and in addition the R-code for the simulation is made publicly available via github: https://github.com/FKappenberg/Paper-InformationSharingAcrossGenes.We added a reference to this repository in the Section Software of the manuscript.
We did not perform a sensitivity analysis with respect to the choice of the prior.However, the assumption of a Normal distribution is well-established in fitting of concentration-response curves, and the parameters directly stem from the data via our empirical Bayes approach, such that we believe our prior is suitable.
Variable ẽ is the log-transformed EC50 parameter (see explanation directly below Equation (1) in the manuscript).Since the Bayesian information sharing approach is only conducted for this individual parameter and not for the entire model (, ), the results only refer to the estimation of this parameter ẽ.
The advantage of the Bayesian normal-normal model as employed in this work is the closed form of the posterior distribution (see Equation ( 2)).Thus, this Bayesian approach works without any sampling, such that no convergence needs to be achieved and the posterior can be calculated directly.Figure 5 shows coverage probabilities for the confidence intervals based on the direct estimation and for the credibility intervals based on the Bayesian approach.It can be observed that the general improvement in terms of the mean squared error does not come at a cost of lower coverage probability.Thus we believe that our presented results provide a full picture of the simulation results.